Predictors: body mass interacting with habitat, effect of >10 kg, (maybe) diet.
mammal_bmr <- read.csv("data/bmr.csv")
logbmr is the base-10 logarithm of the BMR in ml \(O_2\) per hourLog.of.BMR is the base-10 logarithm of the BMR in kcal/dayLog.of.Mass or logmb is the base-10 logarithm of the mass in kgRename some variables and create species name from genus and species. (There is already a variable in the dataset Species.Name but just to be safe and 100% sure of consistency with the genus/species values that may be used in random effects models, building it here.)
mammal_bmr <- mammal_bmr %>%
rename(source = Source,
order = order.corrected,
# n.animals = Number.of.Animals,
mass.kg = body.mass..kg.) %>%
mutate(order = str_to_title(order),
genus = str_to_title(genus),
animal = paste(genus, species),
above.10kg = ifelse(mass.kg >= 10, 'Larger', 'Smaller'))
Keep only variables we will be using. And “factor” “chr” variables.
mammal_bmr <- mammal_bmr %>%
dplyr::select(order,
genus,
species,
common.name,
mass.kg,
log10.body.mass,
log10.bmr,
diet,
habitat,
source,
animal,
above.10kg
) %>%
mutate(across(where(is.character), factor)) %>%
arrange(order, genus, species)
A few quick graphs just to make sure the data are looking as we expect (error checking).
my_scatter_plot <- gf_point(log10.bmr ~ log10.body.mass | habitat,
color = ~order,
# size = ~parse_number(n_animals),
data = mammal_bmr,
alpha = 0.5) %>%
gf_theme(legend.position = 'bottom',
legend.title = element_text(size = 8),
legend.text = element_text(size = 6)) %>%
gf_theme(scale_color_viridis_d('Order')) %>%
gf_labs(x = 'Log10(Mass (kg))',
y = 'Log10(BMR (kcal/day))')
my_scatter_plot
plotly::ggplotly(my_scatter_plot) %>%
plotly::layout(legend = list(#orientation = 'h',
font = list(size = 6)))
Notes: If we wanted to use for web or publication, we would refine. We can edit what info is shown when you hover over a data point, change color scheme, legend, etc.
Will not account for phylogeny at all in the model structure. Predictors: body mass interacting with habitat, effect of >10 kg, (maybe) diet.
lm_model <- lm(log10.bmr ~ log10.body.mass * habitat +
log10.body.mass*above.10kg + diet,
data = mammal_bmr)
tab_model(lm_model)
| log10.bmr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.86 | 1.68 – 2.05 | <0.001 |
| log10.body.mass | 0.77 | 0.68 – 0.86 | <0.001 |
| habitat [land] | -0.20 | -0.38 – -0.02 | 0.026 |
| above.10kg [Smaller] | 0.02 | -0.12 – 0.16 | 0.774 |
| diet [h] | 0.09 | 0.05 – 0.12 | <0.001 |
| diet [o] | 0.05 | 0.01 – 0.08 | 0.003 |
|
log10.body.mass * habitat [land] |
0.02 | -0.07 – 0.12 | 0.646 |
|
log10.body.mass * above.10kg [Smaller] |
-0.12 | -0.20 – -0.03 | 0.008 |
| Observations | 722 | ||
| R2 / R2 adjusted | 0.963 / 0.962 | ||
lm_anova_results <- car::Anova(lm_model)
pander(lm_anova_results)
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| log10.body.mass | 214.9 | 1 | 8570 | 0 |
| habitat | 0.3411 | 1 | 13.61 | 0.0002427 |
| above.10kg | 0.6022 | 1 | 24.02 | 1.183e-06 |
| diet | 0.6796 | 2 | 13.55 | 1.672e-06 |
| log10.body.mass:habitat | 0.005289 | 1 | 0.2109 | 0.6462 |
| log10.body.mass:above.10kg | 0.178 | 1 | 7.098 | 0.007893 |
| Residuals | 17.9 | 714 | NA | NA |
For more on formatting the fitted model table see: https://strengejacke.github.io/sjPlot/articles/tab_model_estimates.html
lm_preds <- predict(lm_model, se.fit = TRUE)
mammal_bmr <- mammal_bmr %>%
mutate(lm_resids = resid(lm_model),
lm_fitted = lm_preds$fit,
lm_fit_lo = lm_preds$fit + 1.96*lm_preds$se.fit,
lm_fit_hi = lm_preds$fit - 1.96*lm_preds$se.fit)
gf_point(lm_resids ~ lm_fitted, data = mammal_bmr)
acf(resid(lm_model))
gf_dhistogram(~lm_resids, data = mammal_bmr,
bins = 20) %>%
gf_fitdistr()
Any predictors not shown in a plot are held constant at their mean or most common value.
gf_line(lm_fitted ~ log10.body.mass,
color = ~habitat,
data = mammal_bmr) %>%
gf_ribbon(lm_fit_lo + lm_fit_hi ~ log10.body.mass,
color = ~habitat, fill = ~habitat) %>%
gf_facet_grid(~ diet) %>%
gf_theme(scale_color_manual(values = my_colors)) %>%
gf_theme(scale_fill_manual(values = my_colors))
gf_point(log10.bmr ~ lm_fitted, data = mammal_bmr,
alpha = 0.1) %>%
gf_labs(y = 'Observed log10(BMR)',
x = 'Model-predicted log10(BMR)',
title = 'Linear Regresssion (no phylogeny)') %>%
gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')
Can refine the plot of predictions later on.
Will include nested random effects of order/family/genus/species, expecting similarity of observations based on a hierarchy of phylogenetic relatedness, but not in a specified/structured way; only groupings are used, with no sense of (for example) the fact that two orders are thought to be “further” apart than any other two.
# this code may generate warnings that are harmless
# https://stackoverflow.com/questions/67040472/warning-in-every-model-of-glmmtmb-givecsparse
re_model <- glmmTMB(log10.bmr ~ log10.body.mass * habitat +
log10.body.mass*above.10kg + diet +
(1 | order/genus/species),
data = mammal_bmr)
tab_model(re_model)
| log10.bmr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.83 | 1.65 – 2.00 | <0.001 |
| log10.body.mass | 0.74 | 0.66 – 0.83 | <0.001 |
| habitat [land] | -0.25 | -0.40 – -0.09 | 0.002 |
| above.10kg [Smaller] | 0.06 | -0.06 – 0.19 | 0.332 |
| diet [h] | 0.05 | 0.01 – 0.09 | 0.014 |
| diet [o] | 0.02 | -0.01 – 0.06 | 0.175 |
|
log10.body.mass * habitat [land] |
0.04 | -0.05 – 0.13 | 0.367 |
|
log10.body.mass * above.10kg [Smaller] |
-0.11 | -0.19 – -0.02 | 0.014 |
| Random Effects | |||
| σ2 | 0.00 | ||
| τ00 species:(genus:order) | 0.01 | ||
| τ00 genus:order | 0.01 | ||
| τ00 order | 0.01 | ||
| ICC | 0.97 | ||
| N species | 626 | ||
| N genus | 415 | ||
| N order | 24 | ||
| Observations | 722 | ||
| Marginal R2 / Conditional R2 | 0.950 / 0.999 | ||
re_anova_results <- car::Anova(re_model)
pander(re_anova_results)
| Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|
| log10.body.mass | 5100 | 1 | 0 |
| habitat | 19.95 | 1 | 7.964e-06 |
| above.10kg | 8.212 | 1 | 0.004161 |
| diet | 6.304 | 2 | 0.04277 |
| log10.body.mass:habitat | 0.8136 | 1 | 0.367 |
| log10.body.mass:above.10kg | 5.983 | 1 | 0.01445 |
re_ave_preds <- predict(re_model,
se.fit = TRUE,
re.form = ~0)
re_ind_preds <- predict(re_model,
se.fit = TRUE,
re.form = NULL)
mammal_bmr <- mammal_bmr %>%
mutate(re_resids = resid(re_model),
re_ind_fitted = re_ind_preds$fit,
re_ave_fitted = re_ave_preds$fit,
re_ave_lo = re_ave_preds$fit - 1.96*re_ave_preds$se.fit,
re_ave_hi = re_ave_preds$fit + 1.96*re_ave_preds$se.fit)
# save fitted model and data
saveRDS(mammal_bmr, 'fitted-models/mr-data.RDS')
saveRDS(re_model, 'fitted-models/mr-re-model.RDS')
gf_point(re_resids ~ re_ind_fitted, data = mammal_bmr)
acf(resid(re_model))
gf_dhistogram(~re_resids, data = mammal_bmr) %>%
gf_fitdistr()
mammal_bmr <- mammal_bmr %>%
mutate(pretty_diet = case_when(diet == 'c'~ 'Carnivore',
diet == 'h' ~ 'Herbivore',
diet == 'o' ~ 'Omnivore'),
pretty_diet = factor(pretty_diet, levels = c('Herbivore', 'Omnivore', 'Carnivore')),
Habitat = stringr::str_to_sentence(habitat))
gf_point(log10.bmr ~ log10.body.mass,
color = ~Habitat, data = mammal_bmr,
size = 0.5, alpha = 0.4) %>%
gf_line(re_ave_fitted ~ log10.body.mass,
color = ~Habitat,
data = mammal_bmr) %>%
gf_ribbon(re_ave_lo + re_ave_hi ~ log10.body.mass,
color = ~Habitat, fill = ~Habitat) %>%
gf_facet_grid(~ pretty_diet) %>%
gf_labs(x = 'Log10(Body Mass (kg))', y = 'Log10(BMR)\n(Dots are observed, lines predicted)') %>%
gf_theme(scale_color_manual(values = my_colors)) %>%
gf_theme(scale_fill_manual(values = my_colors))
We wanted a graph of “aquatic vs terrestrial” – is this what we meant? (My notes don’t say any more about exactly what is to be shown…)
gf_point(log10.bmr ~ re_ind_fitted, data = mammal_bmr,
alpha = 0.1) %>%
gf_labs(y = 'Observed log10(BMR)',
x = 'Model-predicted, Species-specific log10(BMR)',
title = 'Mixed-effects Model (RE of Order/Genus/Species)') %>%
gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')
The graph above is a bit “cheating” as we have a random effect of species, but there are only 1-2 measurements for most of the species (nearly guaranteeing that our estimates will be nearly perfect). What if we also check out the predictions accounting for the modeled effects of Order and Genus, but predicting for the “average” species in each Genus?
no_species <- mammal_bmr %>%
mutate(species = NA)
re_genus_level_preds <- predict(re_model,
se.fit = TRUE,
re.form = NULL,
newdata = no_species)
gf_point(log10.bmr ~ re_genus_level_preds$fit, data = mammal_bmr,
alpha = 0.1) %>%
gf_labs(y = 'Observed log10(BMR)',
x = 'Model-predicted, Genus-specific log10(BMR)',
title = 'Mixed-effects Model (RE of Order/Genus/Species)') %>%
gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')
Weight observations by body mass, binned into log mass bins. The goal is to reduce the undue influence of many data points collected from very small animals.
nr <- nrow(mammal_bmr)
mammal_bmr <- mammal_bmr %>%
mutate(log.mass.bin = cut_width(log10.body.mass,
width = 1,
boundary = log10(0.001))) %>%
group_by(log.mass.bin) %>%
mutate(log.mass.weights = 1 / n())
mammal_bmr <- mammal_bmr %>%
# make weights sum to nrow(mammal_bmr)
mutate(log.mass.weights = log.mass.weights / sum(mammal_bmr$log.mass.weights) * nrow(mammal_bmr)) %>%
ungroup()
# this code may generate warnings that are harmless
# https://stackoverflow.com/questions/67040472/warning-in-every-model-of-glmmtmb-givecsparse
wt_re_model <- glmmTMB(log10.bmr ~ log10.body.mass * habitat +
log10.body.mass*above.10kg + diet +
(1 | order/genus/species),
weights = log.mass.weights,
data = mammal_bmr)
tab_model(wt_re_model)
| log10.bmr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.82 | 1.64 – 1.99 | <0.001 |
| log10.body.mass | 0.75 | 0.67 – 0.83 | <0.001 |
| habitat [land] | -0.25 | -0.40 – -0.09 | 0.002 |
| above.10kg [Smaller] | 0.06 | -0.06 – 0.19 | 0.321 |
| diet [h] | 0.06 | 0.02 – 0.10 | 0.002 |
| diet [o] | 0.04 | 0.00 – 0.07 | 0.030 |
|
log10.body.mass * habitat [land] |
0.04 | -0.05 – 0.12 | 0.394 |
|
log10.body.mass * above.10kg [Smaller] |
-0.11 | -0.19 – -0.02 | 0.012 |
| Random Effects | |||
| σ2 | 0.00 | ||
| τ00 species:(genus:order) | 0.01 | ||
| τ00 genus:order | 0.01 | ||
| τ00 order | 0.01 | ||
| ICC | 0.99 | ||
| N species | 626 | ||
| N genus | 415 | ||
| N order | 24 | ||
| Observations | 722 | ||
| Marginal R2 / Conditional R2 | 0.951 / 1.000 | ||
wt_re_anova_results <- car::Anova(wt_re_model)
pander(wt_re_anova_results)
| Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|
| log10.body.mass | 5091 | 1 | 0 |
| habitat | 21.38 | 1 | 3.775e-06 |
| above.10kg | 8.575 | 1 | 0.003408 |
| diet | 9.461 | 2 | 0.008821 |
| log10.body.mass:habitat | 0.7255 | 1 | 0.3943 |
| log10.body.mass:above.10kg | 6.298 | 1 | 0.01209 |
Only include species with body mass of 10 kg or more.
big_mammal_bmr <- mammal_bmr %>%
filter(log10.body.mass >= log10(10)) %>%
mutate(across(where(is.factor), factor))
# this code may generate warnings that are harmless
# https://stackoverflow.com/questions/67040472/warning-in-every-model-of-glmmtmb-givecsparse
big_re_model <- glmmTMB(log10.bmr ~ log10.body.mass * habitat + diet +
(1 | order/genus/species),
data = big_mammal_bmr)
tab_model(big_re_model)
| log10.bmr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.91 | 1.65 – 2.17 | <0.001 |
| log10.body.mass | 0.68 | 0.58 – 0.78 | <0.001 |
| habitat [land] | -0.39 | -0.63 – -0.14 | 0.002 |
| diet [h] | 0.03 | -0.11 – 0.16 | 0.720 |
| diet [o] | -0.04 | -0.15 – 0.07 | 0.467 |
|
log10.body.mass * habitat [land] |
0.12 | -0.01 – 0.25 | 0.080 |
| Random Effects | |||
| σ2 | 0.01 | ||
| τ00 species:(genus:order) | 0.00 | ||
| τ00 genus:order | 0.00 | ||
| τ00 order | 0.04 | ||
| ICC | 0.87 | ||
| N species | 58 | ||
| N genus | 52 | ||
| N order | 12 | ||
| Observations | 60 | ||
| Marginal R2 / Conditional R2 | 0.823 / 0.977 | ||
big_re_anova_results <- car::Anova(big_re_model)
pander(big_re_anova_results)
| Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|
| log10.body.mass | 387.9 | 1 | 2.39e-86 |
| habitat | 14.08 | 1 | 0.0001755 |
| diet | 1.113 | 2 | 0.5732 |
| log10.body.mass:habitat | 3.07 | 1 | 0.07973 |
#Read in trees from Upham et al
tree_path <- paste("data/upham-trees/Completed_5911sp_topoCons_FBDasZhouEtAl")
tree_files <- list.files(tree_path)
all_trees <- list()
for (i in 1:length(tree_files)){
all_trees[[i]] <- read.tree(paste0(tree_path, '/',
tree_files[i]))
if (i == 1){
treeset <- all_trees[[i]]
}else{
treeset <- c(treeset, all_trees[[i]])
}
}
all_tip_labels <- purrr::map(treeset, "tip.label")
all_tip_labels <- purrr::map(all_tip_labels,
function(x)
stringr::str_replace_all(x, pattern = '_',
replacement = ' '))
# get list of species that are in ALL the trees
for (t in 1:length(all_tip_labels)){
if (t == 1){
tip_labs <- all_tip_labels[[t]]
}else{
tip_labs <- intersect(tip_labs, all_tip_labels[[t]])
}
}
# keep only the species in mammal_bmr that are in all the trees
# on 4/14 this removes 1 species.
pgls_data <- mammal_bmr %>%
mutate(animal = as.character(animal)) %>%
filter(animal %in% tip_labs) %>%
droplevels()
taxonomy <- read_csv('data/upham-trees/taxonomy_mamPhy_5911species.csv')
Fit models, one model for every tree in our list.
pgls_models <- list()
for (t in c(1:length(treeset))){
# make sure there is only one row of data per species (why sample and not average -- seems dubious??)
pgls_rep_data <- pgls_data %>%
group_by(animal) %>%
sample_n(1) %>%
ungroup
#Reduce the tree to only include those species in the data set
refit_tree <- treeset[[t]]
refit_tree$tip.label <- str_replace_all(refit_tree$tip.label, '_', ' ')
refit_tree <- drop.tip(refit_tree,
setdiff(refit_tree$tip.label,
unique(pgls_rep_data %>% pull(animal))))
#Order the data set so that it is in the same order as the tip labels of the tree
pgls_rep_data <- left_join(data.frame(tree.tip.label = refit_tree$tip.label),
pgls_rep_data,
by = c('tree.tip.label' = 'animal'),
keep = TRUE)
# fit the model
pgls_models[[t]] <- tryCatch({
fittd <- gls(log10.bmr ~ log10.body.mass * habitat +
log10.body.mass * above.10kg + diet ,
correlation = corPagel(value = 0.8,
phy = refit_tree,
fixed = FALSE,
form = ~animal),
data = pgls_rep_data)
},
error = function(cond){
message(paste('PGLS fit failed for tree', t))
return(NULL)
}
)
}
pgls_models <- pgls_models[-which(lapply(pgls_models, is.null) == T)]
Note: we tried to fit 101 PGLS models (each with a different tree); of these, model fitting failed for 101.
Combine the many PGLS model runs together into one summary combined model according to Rubin’s rule.
# as.mira takes the list of models and create an object to be used by the mice package
pgls_mira <- as.mira(pgls_models)
# # pool summarise the models using Rubin's rule corrected for small samples
pooled_pgls <- pool(pgls_mira)
pooled_pgls_summ <- summary(pooled_pgls, type = 'all', conf.int = TRUE)
pander(pooled_pgls_summ %>% select(term, estimate, std.error, `2.5 %`, `97.5 %`, lambda, fmi))
| term | estimate | std.error | 2.5 % |
|---|---|---|---|
| (Intercept) | 1.694 | 0.1617 | 1.377 |
| log10.body.mass | 0.7698 | 0.03897 | 0.6933 |
| habitatland | -0.1655 | 0.07023 | -0.3034 |
| above.10kgSmaller | 0.05458 | 0.05738 | -0.05809 |
| dieth | 0.005226 | 0.02093 | -0.03588 |
| dieto | -0.009785 | 0.01762 | -0.04438 |
| log10.body.mass:habitatland | 0.01167 | 0.04312 | -0.07299 |
| log10.body.mass:above.10kgSmaller | -0.07717 | 0.03831 | -0.1524 |
| 97.5 % | lambda | fmi |
|---|---|---|
| 2.012 | 0.001913 | 0.004746 |
| 0.8463 | 0.00118 | 0.004013 |
| -0.02758 | 0.002059 | 0.004891 |
| 0.1672 | 0.002718 | 0.005551 |
| 0.04633 | 0.008937 | 0.01177 |
| 0.02481 | 0.007298 | 0.01013 |
| 0.09634 | 0.000957 | 0.00379 |
| -0.001944 | 0.001083 | 0.003916 |
pgls_anova <- Anova(pgls_mira)
pander(pgls_anova)
| F | num df | den df | missing info | |
|---|---|---|---|---|
| log10.body.mass | 4970 | 1 | 5931857 | 0.003943 |
| habitat | 10.75 | 1 | 10557358 | 0.002955 |
| above.10kg | 3.735 | 1 | 2196269 | 0.00648 |
| diet | 0.6599 | 2 | 5443751 | 0.005941 |
| log10.body.mass:habitat | 0.07329 | 1 | 100675163 | 0.000957 |
| log10.body.mass:above.10kg | 4.056 | 1 | 78576839 | 0.001083 |
| riv | Pr(>F) | |
|---|---|---|
| log10.body.mass | 0.003958 | 0 |
| habitat | 0.002964 | 0.001043 |
| above.10kg | 0.006522 | 0.05328 |
| diet | 0.005976 | 0.5169 |
| log10.body.mass:habitat | 0.0009579 | 0.7866 |
| log10.body.mass:above.10kg | 0.001084 | 0.044 |
Alternative approach: using MuMIn to do model averaging. This will weight models according to their information criteria scores instead of weighting them all equally. But we see here the results are nearly identical.
pgls_avg <- model.avg(pgls_models,
rank = function(x) 1)
This gives a model with the same coefficients (same model) that I think we can better make predictions from. The other way (with pool()) is better for getting the ANOVA results we wanted. Two ways of getting the same result, in different packages.
For the PGLS model, we also want to extract the estimate of \(\Lambda\), which tells us about how the phylogeny is affecting the correlation structure.
Previous approach was to take the mean of the estimates of \(\Lambda\) from all the individual fitted models.
lambda <- mean(unlist(purrr::map(pgls_models,
function(x) x$modelStruct$corStruct)))
lambda
## [1] 0.8891128
According to this simple method our estimate is \(\hat{\Lambda} =\) 0.889.
It’s not really clear how to even approach this, since we don’t really any longer expect the residuals to be normal or independent. And based on the data we know there’s not a huge issue with linearity. So…ok?
There’s a problem here in that the “pooled” model object does NOT have any information stored anymore about the variance-covariance matrix, which is key for making predictions from the model. So for making predictions we need to use package MuMIn and its model.avg() function, just weighting equally all the models from all the trees. This results in a fitted single model object from which prediction is possible with predict(). From initial inspection the coefficient estimates of the averaged model are the same as the previous version averaged with the mice package.
pgls_preds <- predict(pgls_avg,
se.fit = TRUE,
newdata = mammal_bmr)
mammal_bmr <- mammal_bmr %>%
mutate(pgls_fitted = pgls_preds$fit,
pgls_lo = pgls_preds$fit - 1.96*pgls_preds$se.fit,
pgls_hi = pgls_preds$fit + 1.96*pgls_preds$se.fit
)
gf_line(pgls_fitted ~ log10.body.mass,
color = ~habitat,
data = mammal_bmr) %>%
gf_ribbon(pgls_lo + pgls_hi ~ log10.body.mass,
color = ~habitat, fill = ~habitat) %>%
gf_facet_grid(~ diet) %>%
gf_theme(scale_color_manual(values = my_colors)) %>%
gf_theme(scale_fill_manual(values = my_colors))
gf_point(log10.bmr ~ pgls_fitted,
data = mammal_bmr,
alpha = 0.1) %>%
gf_labs(y = 'Observed log10(BMR)',
x = 'Model-predicted log10(BMR)',
title = 'PGLS Model') %>%
gf_abline(slope = 1, intercept = 0, color = 'black', linetype = 'dashed')
Notice that the CIs are wider here than in the original linear regression model. This is because of appropriate incorporation of the fact that the observations are not independent and there’s phylogenetic signal in the data.
As far as I know though, there is not a way to make predictions for specific species or taxa that are more accurate and account for the “way” in which the species/taxon tends to deviate from the overall average. Instead, our overall uncertainty is adjusted to account for the dependence that we know is in the data. But we can’t (I don’t think) decompose it into a random part and a phylogeny part like we can with the random effects; they are both mixed inextricably together.
We could consider grouping by by body-mass ranges or phylogeny and then assess predictive accuracy as the measure of “success”
This remains to do.
Want to repeat a similar analysis with other datasets on other response variables of interest in a “physiological hierarchy”:
All these are other candidate response variables. We started with metabolism, and want to move on to these other metrics.
Main predictor of interest is habitat, controlling for body size and phylogeny: Are aquatic/marine animals different in some notable way?
Data sets for these other variables are smaller (about n = 50 species).